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ABSTRACT 

Diffusion  geometry  techniques  are  useful  to  classify  patterns  and  visualize  high-dimensional 
datasets.  Building  upon  ideas  from  diffusion  geometry,  we  outline  our  mathematical  foun¬ 
dations  for  learning  a  function  on  high-dimension  biomedical  data  in  a  local  fashion  from 
training  data.  Our  approach  is  based  on  a  localized  summation  kernel,  and  we  verify  the 
computational  performance  by  means  of  exact  approximation  rates.  After  these  theoretical 
results,  we  apply  our  scheme  to  learn  early  disease  stages  in  standard  and  new  biomedical 
datasets. 
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1.  INTRODUCTION 

As  personalized  medicine  expands,  increasingly  detailed  biomedical  data  must  be  integrated  to  better 
understand  normal  function  and  evolution  of  multifactorial  chronic  disease  in  clinical  trials  and  indi¬ 
vidual  treatment  decisions.  The  complexity  of  molecular,  cellular,  and  tissue  interactions,  however,  is  a 
fundamental  barrier  to  extracting  the  complicated  relationships  that  underlie  human  physiology. 

A  recent  idea,  originating  in  computational  harmonic  analysis,  is  to  let  the  data  speak  for  itself.  In  this 
approach,  one  deals  typically  with  high-dimensional,  unstructured  data.  In  theoretical  analysis,  one  assumes 
that  the  data  represents  a  sample  from  some  unknown  low-dimensional  manifold  embedded  in  a  high¬ 
dimensional  ambient  Euclidean  space.  The  objective  is  then  to  understand  the  geometry  of  this  manifold. 
Thus,  statistical  techniques  have  been  devised  to  estimate  the  dimension  of  this  manifold  (Costa  and  Hero, 
2004).  A  simulation  of  Brownian  motion  is  expected  to  reveal  the  relative  neighborhoods  of  different  data 
points,  as  well  as  provide  local  coordinate  systems  for  the  manifold  (Jones  et  al..  2010;  Lafon,  2004).  See 
the  special  issue  (Chui  and  Donoho,  2006)  for  an  introduction  to  these  ideas.  Related  analysis  of  graphs  is 
discussed  in  Pesenson  and  Pesenson  (2010). 

In  many  practical  applications,  one  needs  to  go  beyond  an  understanding  of  the  manifold  and  answer 
queries  based  on  the  data.  These  queries  can  be  modeled  mathematically  as  functions  in  the  (unknown) 
manifold.  This  function  may  be  known  to  us  on  few  training  points,  and  we  aim  to  accurately  predict  the 
value  of  the  function  of  items  that  are  not  yet  observed.  Models  for  this  have  been  developed  as  eigenmaps/ 
diffusion  maps  (Coifman  et  al.,  2005a),  multiscale  approaches  (Coifman  et  al.,  2005b;  Gavish  et  al.,  2010; 
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Nadler  and  Galun,  2007),  and  nonlinear  dimension  reduction  (Belkin  and  Niyogi,  2003;  Chui  and  Wang, 
2010;  Roweis  and  Saul,  2000;  Saito,  2008;  Tenenbaum  et  al.,  2000;  Wu  et  al.,  2010). 

Necessarily,  any  such  model  cannot  be  expected  to  yield  perfect  reproduction  of  the  actual  target 
function.  The  subject  of  approximation  theory  deals  with  the  intrinsic  errors  inherent  in  constructing 
different  kinds  of  models  for  the  target  function.  In  traditional  scenarios,  the  accuracy  of  approximation  is 
closely  related  to  the  smoothness  of  the  function.  Because  of  this  history,  many  experts  in  approximation 
theory  nowadays  consider  the  accuracy  of  approximation  itself  to  be  a  measurement  of  smoothness.  This 
viewpoint  is  particularly  useful  in  our  setting,  where  the  manifold  is  unknown  and,  therefore,  it  is  im¬ 
possible  to  define  the  smoothness  in  a  classical  manner. 

The  last  named  author  and  his  collaborators  have  developed  approximation  theory  tools  applicable  in  the 
current  context  in  a  series  of  papers  (Filbir  and  Mhaskar,  2010;  Filbir  and  Mhaskar,  2011;  Maggioni  and 
Mhaskar,  2008;  Mhaskar,  2010;  Mhaskar,  2011).  A  particularly  interesting  aspect  of  this  theory  is  a 
definition  of  pointwise  smoothness  of  the  target  function.  The  research  has  also  enabled  us  to  devise 
specific  algorithms,  extending  the  theory  developed  for  the  understanding  of  geometry,  with  the  property 
that  the  rate  of  convergence  of  these  algorithms  in  neighborhoods  of  different  points  completely  charac¬ 
terize  local  smoothness  properties  of  the  target  function  at  those  points. 

Such  local  smoothness  ideas  are  particularly  useful  in  biomedicine,  as  disease  progression  underlies 
natural  variations,  medication  leads  to  abrupt  changes  in  disease  progression,  and  environmental  factors 
vary  quickly,  so  that  the  query  function  might  not  be  globally  smooth.  While  late  disease  stages  underlie 
large  variations,  the  transition  from  healthy  to  early  pathology  can  be  smooth,  leading  to  query  functions 
that  are  locally  smooth  within  such  early  disease  transitions. 

In  this  article,  we  shall  review  and  further  develop  the  salient  features  of  diffusion  geometry  and 
approximation  theory  needed  to  “learn  locally”  from  the  acquired  data.  In  contrast  to  common  clustering 
methods  used  in  biomedicine,  we  explicitly  use  that  the  clusters  represent  disease  stages,  i.e.,  are  ordered 
quantitatively  in  a  progressing  fashion.  Thus,  some  unspecific  ordering  is  a-priori  fed  into  the  clustering  process 
and  is  specified  in  the  final  result.  Moreover,  instead  of  interpolating  on  the  training  data,  which  usually  leads  to 
instabilities  with  large  training  sets,  we  allow  our  algorithm  to  correct  misclassified  training  points. 

The  outline  is  as  follows.  In  Section  2,  we  briefly  discuss  two  approaches  to  reconstructing  the  query 
function  from  training  data.  In  Section  3,  we  present  our  local  learning  approach.  The  numerical  im¬ 
plementation  is  discussed  in  Section  4.  In  Section  5,  we  outline  our  scheme  for  the  special  case  in  which  the 
manifold  is  the  sphere.  We  apply  our  methods  to  analyze  several  biomedical  datasets  in  Section  6. 


2.  APPROXIMATION  ON  UNKNOWN  MANIFOLDS 

Let  X  be  the  underlying  but  unknown  compact  manifold,  endowed  with  a  probability  measure  /(.  The 
training  data  C={y,}fl1  C  X  yield  pairs  of  the  form  {(yj,  Zj)}fL  \ ,  with  z,j  ~  f(y y)  for  the  as  yet  unknown 
function  /.  This  defines  /  only  on  C.  The  objective  is  to  extend  this  function  to  the  entire  manifold  X, 
including  the  data  already  observed  and  the  data  not  yet  available.  In  deterministic  analysis,  we  do  not  deal 
with  the  noise  explicitly,  although  some  probabilistic  estimates  can  be  given  in  addition  to  deterministic 
guarantees  that  account  for  the  noise  (Gia  and  Mhaskar,  2008a). 

There  are  two  common  approaches  to  solving  this  problem.  In  the  first  approach,  one  finds  the  extension 
fc  as  a  solution  of  some  minimization/regularization  problem,  for  example, 

fc=  arg  min{||g-z||+<S||g||jr},  (1) 

where  <5  is  a  balancing  parameter,  T  is  a  suitable  class  of  functions  to  choose  the  modeling  function  g  from, 
g  denotes  the  vector  (g(yi),  •  •  • ,  giyiu)),  z  denotes  the  vector  (z,t ,  . . . ,  Zm),  ||  •  ||  is  some  norm  on,  UM,  and 
||  •  ||jr  is  a  penalty  functional,  commonly  the  norm  on  T.  We  will  call  Equation  (1)  the  optimization 
approach. 

The  other  approach  is  to  imagine  a  priori  that  there  is  an  underlying  unknown  function /on  X,  so  that 
f(yi)  =  Zi,  i=  1,  •  •  • ,  M.  We  then  seek  a  model  P  £  T  so  that  for  some  performance  guarantee  ec, 

\\f-P\L<ec\\f\\r, 

where  ||  •  1 1  denotes  the  supremum  norm.  We  will  call  Equation  (2)  the  approximation  approach. 
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The  optimization  approach  has  the  advantage  that  the  penalty  functional  may  be  chosen  to  reflect  some 
domain-specific  knowledge  about  the  target  function.  Also,  even  if  one  does  not  expect  any  function 
underlying  the  phenomenon,  one  gets  some  smooth  model  to  work  with.  On  the  other  hand,  when  we  do  not 
know  with  absolute  certainty  any  physical  model  that  underlies  the  data,  and  are  seeking  a  function  on  X  as 
a  model  anyway,  then  it  is  more  natural  to  assume  that  there  is  an  underlying  function  from  which  the  data 
is  sampled,  even  though  the  function  itself  is  yet  unknown,  so  that  the  approximation  approach  seems  more 
natural. 

We  noted  some  comparisons  between  the  two  approaches.  First,  we  the  optimization  approach  does  not 
necessarily  imply  any  performance  guarantees  of  Equation  (2).  Moreover,  the  value  of  the  regularization 
functional  depends  upon  the  data.  There  are  no  bounds  to  how  large  this  value  might  get  as  more  and  more 
data  are  introduced.  Finally,  there  are  common  computational  issues  such  as  local  minima,  convergence  of 
the  algorithms,  and  convergence  of  the  minimizers  /-as  the  data  becomes  dense  on  the  manifold.  All  of 
these  issues  are  completely  avoided  in  the  approximation  approach. 

We  will  demonstrate  below  that  in  the  approximation  approach,  we  can  construct  a  linear  operator  with 
mathematical  performance  guarantees  of  Equation  (2).  We  do  not  need  to  solve  any  minimization 
problem,  so  that  all  the  computational  issues  mentioned  above  are  avoided  altogether.  Moreover,  we  can 
design  this  operator  in  a  manner  that  its  performance  guarantees  are  automatically  better  on  regions  of  X 
where  the  target  function  is  “smoother.”  This  does  not  involve  a  careful  detection  of  edges  and  parti¬ 
tioning  of  X. 


3.  LOCAL  APPROXIMATION  OF  THE  QUERY  FUNCTION 

3.1.  Local  smoothness  classes 

Before  we  can  define  local  smoothness  of  a  function/  :  X  — >  R,  we  must  specify  a  few  more  objects.  Let 
{‘P/tlfclo  be  the  eigenfunctions  of  the  Laplace-Beltrami  operator  A  on  X,  and  the  associated 

eigenvalues,  ordered  in  a  nondecreasing  way  with  !0  =  0  and  £k  — *■  oo  as  k  — *  oo  .  The  space  of  diffusion 
polynomials  up  to  degree  N  is  I/v  :=span{cp^  :  £k<N}.  Further  technical  details  are  discussed  in 
Appendix  A.  Note  that  {<Pi}^Lo  was  replaced  by  a  more  general  orthonormal  basis  for  L2(X,//)  in  Filbir  and 
Mhaskar  (2010,  2011). 

The  object  of  interest  in  approximation  theory  is  the  degree  of  approximation  of  the  target  function: 

EN(f)=  min  || f-P  H^.  (3) 

reilN 

Equation  (3)  measures  the  best  error  achievable  if  one  wishes  to  use  I  l  v  as  the  model  for/,  and  wishes  the 
error  to  be  small  at  each  point  of  X.  It  turns  out  that  the  rate  at  which  the  quantity  Ejff)  decreases  to  0  as 
N  — y  oo  is  closely  related  to  the  smoothness  off.  Thus,  if/ is  smooth  enough  so  that  A'/  £  ^(X)  for  some 
integer  r  >  1,  then 

EN(f)<^  m\L, 

where  c  >  0  is  a  constant.  Here,  ?>(%)  denotes  the  continuous  functions  on  X  endowed  with  the  supremum 
norm.  In  practice,  it  is  quite  common  to  find  an  approximation  of/ by  ad  hoc  means.  This  gives  rise  to  the 
question  that  if  one  finds  EN(f)  {  0  at  a  certain  rate,  is  it  because /inherits  a  certain  smoothness?  Thus,  the 
correct  notion  of  smoothness  required  to  answer  this  question  is  defined  in  terms  of  a  regularization 
functional  (known  in  some  branches  of  analysis  as  the  /G  functional).  If  r  >  1  is  an  integer,  this  functional  is 
defined  by 

Kr(f,8)=M{\\f-g\\00+82r\\A'g\\00},  «5>0, 

where  the  infimum  is  taken  over  all  g  for  which  ||A'g  <  oo.  In  the  terminology  introduced  before,  this 

measures  the  approximation  of/by  smooth  functions  g  while  controlling  the  growth  of  A rg  with  the  help  of 
the  regularization  parameter  8.  It  is  known  (Maggioni  and  Mhaskar,  2008)  that  if  s  >  0  and  r  >  s/2  is  any 
integer,  then 

EN(f)  =  0(N~s)  as  N  — >  oo  if  and  only  if  Kr(f,  8)  =  0  (8s)  as  8  — >  0. 
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Standard  approximation  theory  arguments  show  that  if  y  <  s/2  is  an  integer,  and  j>  =  s  -  2y,  then 
EN(f)  =  0(N~s)  if  and  only  if  A yf  £  -5?(X)  and  Kr{Ayf ,  8)  =  0(8P). 

In  particular,  the  choice  of  r  >  s/2  is  not  critical  except  that  the  constants  involved  in  the  O  relations  will 
depend  upon  the  choice  of  r.  This  leads  us  to  define  the  (global)  smoothness  class  IT'  directly  in  terms  of 
the  quantities  £/(/)  as 

WS(X)  :={/  €  <P(X)  :  EN(f)  =  Q(N~s )}, 

endowed  with  the  norm  ||  •  ||W1  :=  ||  •  H^-t-  ||(A;’£A'(f))wlloo-  We  wiH  think  of  s  as  the  parameter  mea¬ 
suring  the  smoothness  of  the  function  /. 

In  the  context  of  data-dehned  manifolds,  we  do  not  explicitly  know  any  formulas  for  the  local  coordinate 
charts.  Therefore,  it  is  not  easy  to  define  the  notion  of  derivatives.  Nevertheless,  we  can  redefine  the  notion 
of  an  infinitely  differentiable  function  on  our  unknown  manifold  as  membership  in  every  smoothness  class 
VT'(X).  Thus,  the  set  of  all  infinitely  differentiable  functions  is  denoted  by  <^’°°(X)=  P|J>0  IT'(X).  If 
vo  G  X,  we  will  define  the  local  smoothness  of/ at  x0  by  the  natural  windowing  construction.  Thus,  we  say 
that  f  £  WJo(X)if  there  exists  a  neighborhood  U  of  xQ  such  that  for  every  cj)  £  ^“(X),  supported  on  U, 
of  c  IV'(X). 


3.2.  Localized  summation  kernels 

We  can  clearly  expand  any  /  £  Li(X,  /<)  in  the  orthonormal  basis  by /=  o(f’  Ok) Ok-  To 

motivate  our  scheme  presented  later,  we  manipulate  this  expression  without  caring  about  convergence  and 
interchanging  limits,  so  that  we  derive 


f(x)=  I  f(y)Ok(y)dfj-(y)ok(x)=  /  f<y)&(x,yW(y), 


k  =  0  ■ 


(4) 


where  we  formally  use  0(v,  y)=  Y^k=o  Ok(y)Ok(x)-  This  representation  requires  knowledge  of/on  the  entire 
manifold  X.  To  reconstruct /  from  the  data  only,  we  must  replace  the  integral  with  a  finite  sum  over  data 
points  and  localize  the  kernel  O  so  that  fix)  is  determined  by  its  values  in  a  neighborhood  around  x.  In  this 
section,  we  shall  make  these  ideas  mathematically  precise  and  construct  a  linear  operator  based  on  the  data 
to  derive  P  €  I/v  that  essentially  minimizes  Equation  (3). 

First,  we  define 


00  /  £  \ 

® n(x, y)  ■■=  h\  Tj )  <Pk(x)<Pk(y)>  for  a11  y  e  x,  (5) 

k= 0  / 

where  h  :  R  — >  [0,  1]  satisfies  h(t)  =  1  if  |f|  <  1/2  and  h(t )  =  0  if  |f|  >  1,  see  Figure  1  for  a  typical  function  h 
and  Appendix  A  for  a  few  more  technical  conditions.  Due  to  the  “cut-off”  function  h,  Equation  (5)  is  a 


FIG.  1.  Filter  function:  A  typical  filter  function  h  that  can  be  used  in  Equation  (5). 
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finite  sum,  and  <1>V,  as  a  function  of  only  one  variable  x  or  y,  is  contained  in  I  I  v.  Under  the  technical 
assumptions  collected  in  Appendix  A,  the  estimate 


\&N(x,y)\ ^ 


N« 

max(l,  ( Np(x,y))s 


(6) 


holds,  where  a  is  the  dimension  of  X,  p  the  metric  on  X,  and  S  any  integer  (Filbir  and  Mhaskar,  2010; 
Maggioni  and  Mhaskar,  2008;  Mhaskar,  2011).  Here  means  that  there  is  an  absolute  positive  constant  on 
the  right-hand  side  such  that  the  inequality  holds.  This  localization  property  (6),  very  much  like  multi-scale 
approaches  with  wavelets,  enables  local  analysis  of  functions  on  the  manifold. 

Iff  x  P,  then  the  we  also  have  f<S)N(x,  • )  «  FOy (x,  •  )•  Naturally,  we  would  expect  that  the  product  of 
two  polynomials  is  again  a  polynomial  of  a  larger  degree,  so  that  P(\\\'(x,  • )  £  TTaA/,  for  some  fixed  a  >  0. 
Here,  we  only  need  the  weaker  product  assumptions.  (Appendix  A),  implying  that  F<Ev(x,  ■ )  can  be 
sufficiently  well  approximated  with  polynomials  in  n„,v.  To  replace  the  integral  in  Equation  (4)  with  a 
finite  sum,  we  need  a  quadrature  formula  that  is  exact  for  polynomials  up  to  degree  ciN :  If  our  data 
C—  are  sufficiently  dense  in  X  (Appendix  B),  then  there  are  quadrature  weights  {a)i}f=l  such  that 


P(x)dp(x)=  CQjPjxj), 


7=  1 


for  all  P  £  n„,v  ■ 


Since  (po=  1,  these  weights  can  be  computed  by  solving  the  linear  system  of  equations 

M 

C0j(pk(xj)  =  8k'  o,  for  all  A:  =  0,  1,2,  ....  ciN. 

We  refer  to  Appendix  B  and  Filbir  and  Mhaskar  (2010)  for  details  on  the  existence  of  such  quadrature 
weights  with  respect  to  the  training  data. 

We  now  replace  the  right-hand  side  of  Equation  (4)  with 

M 

aN(f,  x)  :  =  (£>jf(yj)^N{x,  yf).  (7) 

7=1 

To  study  the  asymptotics  for  N  — >  co  ,  we  call  for  a  sequence  of  training  sets  C,y  that  induce  quadrature 
formulas  of  strength  ciN.  We  shall  verify  in  Appendix  C  that,  for  f  £  Wv!o(X),  there  is  (5  >  0,  such  that, 

SUP  \f(x)-aN{f,x)\^N~s,  (8) 

x£Bs(yo) 

where  Bsiyo )  C  Xdenotes  a  ball  of  radius  <5  around  y0.  Thus,  when /is  locally  smooth  in  a  neighborhood 
around  y0,  then  we  can  locally  reconstruct  /  from  the  training  data.  The  analogous  result  for  functions  that 
are  globally  smooth  is  contained  in  Filbir  and  Mhaskar  (2010,  2011),  Maggioni  and  Mhaskar  (2008),  and 
Mhaskar  (2011),  which  contains  local  estimates,  but  the  approximand  requires  global  knowledge  of/and  is 
not  purely  defined  through  the  training  data  only. 


4.  SPECIFYING  cpK  NUMERICALLY 

To  design  a  numerically  feasible  algorithm,  we  still  need  to  compute  a  suitable  orthonormal  basis 
{‘Pa-I^o  f°r  p),  while  the  manifold  X  is  not  explicitly  known  to  us.  In  a  typical  situation,  we  may 

have  little  training  data  and  a  much  larger  but  finite  collection  {y,}'^ ,  of  data  lying  on  the  manifold.  These 
data  shall  be  used  to  approximate  an  orthonormal  basis{<^>jt}/</0. 

The  eigenfunctions  of  the  Laplace-Beltrami  operator  Ax  =  divV  on  the  manifold  form  an  orthonormal 
basis  that  satisfies  the  technical  assumptions  needed  in  Appendix  A.  In  order  to  compute  them  numerically, 
we  build  the  graph  Laplacian  from  a  finite  set  of  points  on  the  manifold  as  follows:  By  using  data  points 
{yi}f=i  C  X,  the  standard  heat  kernel  k£(x,y)  =  e~  I*-5,  H  /2e,  e  >  0,  induces  the  weight  matrix  defined 
by  WM.i  j  =  ke(yhyj).  We  build  the  diagonal  matrix  D£Mi  ;=  l  i  j  anc^  define  the  (unnormalized) 
graph  Laplacian  as  L£M  =  yVi'M-D>M.  For  M  — >  o°  and  s  — >  0,  the  eigenvalues  and  interpolations  of  the 
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eigenvectors  of  /^converge  toward  the  “interesting”  eigenvalues  and  eigenfunctions  of  A;<  when 
{ V/}f 1  j  are  uniformly  distributed  on  X.  See  Appendix  D  for  the  details  and  technical  assumptions.  If  , 
are  distributed  according  to  the  density  p.  then  the  graph  Laplacian  approximates  the  elliptic  Schrodinger- 
type  operator  A+  y  (Coifman  et  al.,  2005b;  Nadler  et  al.,  2006),  whose  eigenfunctions  also  form  an 
orthonormal  basis  for  p). 


5.  LOCALIZED  KERNELS  ON  THE  SPHERE 

The  data  of  one  of  our  applications  lie  on  the  sphere  Sd_1,  so  that  we  specify  our  approach  for  X  =  S'1  ~ 1 
(Gia  and  Mhaskar,  2006;  Gia  and  Mhaskar,  2008b).  For  /  :  Sd~l  — >  R,  let  f(x)  :  =f(x/  ||x  ||)  be  its  ex¬ 
tension  to  1RC,\{0}.  The  Laplace-Beltrami  operator  ASd-i  is  ASd-if=(ARjf),Sd-\,  and  the  set  of  spherical 
harmonics  7 idk  is  formed  by  the  homogeneous  harmonic  polynomials  p  on  DT  of  degree  k  restricted  onto  the 
sphere  Sd~l.  In  other  words,  p  is  a  polynomial  whose  monomials  have  all  the  same  total  degree  k 
anddRd/?  =  0.  The  eigenspaces  of  ASd-i  are  7 id  with  eigenvalues  k(k  +  d  -  2),  respectively,  and  the  di¬ 
mension  of  7 id  is  mk  :=  (k+d  l)  -  (A^23)-  ^  we  ch°ose  an  orthonormal  basis  {<pk  ;}™) ,  for  TLdk,  then 
Y^j=  i  (Pkj(x)(Pk,j(y)  does  not  depend  on  the  choice  of  {  (pk  ]}j!L ,  and  coincides  with  Pk((x,y)),  where  Pk  is 
the  Gegenbauer  polynomial  of  degree  k  and  parameter  d/1  -  1,  sec.  Appendix  E.l  and  Stein  and  Weiss 
(1971).  Therefore,  we  can  explicitly  compute 

W,y)=£h(/W±±E)Pl(</C,y)). 

To  derive  oN(f,x)  in  Equation  (7),  we  still  need  to  determine  the  quadrature  weights  {oj7}"=  , ,  which  we  shall 
properly  describe  in  Appendix  E.2.  Here,  we  only  present  a  heuristic  approach.  Since  { Pk((x,  ■ )) )  xeSd  , 
generates  7 idk,  we  aim  to  solve,  for  R  as  large  as  possible, 

n 

ojjPk(( x,  Xj))  =  Sk  o,  for  all  k  =  0,  . . . ,  R, 

7=1 

and  mk  =  dim  CH'I)  random  choices  of  .r  G  Sd  ~  1 .  This  leads  to  a  linear  system  of  equations  whose  solution 
is  reasonably  close  to  exact  weights  in  practice,  but  only  for  small  parameters  d,  n,  and  R.  If  any  of  these 
parameters  is  not  small,  then  the  problem  becomes  numerically  unstable,  and  we  need  to  follow  the 
approach  presented  in  Appendix  E. 

If  the  manifold  is  known,  as  in  the  case  of  the  sphere,  many  types  of  basis  functions  can  be  constructed 
explicitly.  For  instance,  spherical  wavelets  were  considered  in  Antoine  and  Vandergheynst  (1995),  Dahlke 
et  al.  (1995,  2004),  and  Starck  et  al.  (2006),  which  can  capture  multi-scale  structure.  However,  replacing 
the  eigenfunctions  {(pk}^L0  of  the  Laplace-Beltrami  operator  would  require  checking  in  each  case  if  all  the 
conditions  in  Appendix  A  are  satisfied.  Therefore,  we  shall  not  follow  this  path  and,  here,  exclusively  use 
the  eigenfunctions  of  the  Laplace-Beltrami  operator. 


6.  APPLICATIONS 

We  use  the  developed  approximation  scheme  to  cluster  biomedical  data  into  disease  stages  {C/}f=0. 
Therefore,  the  classes  (disease  stages)  have  a  natural  ordering,  and  we  assign  a  meaningful  number  c /  to 
cluster  C/  such  that  0<co  <ci  < . . .  <cl.  Hence,  |c,-  —  Cj\  represents  the  distance  between  C,-  and  Cr  The 
query  function/is  defined  on  the  training  data  by/(x,)  =  q  if  and  only  if  Xj  £  C/.  The  approximand  a N(fx) 
induces  a  nearest  neighbor  classification  on  the  entire  dataset  by  the  proximity  of  nN(fx)  to  any  of  the 
numbers  Co,  . . . ,  ck. 

We  first  compare  our  proposed  approach  to  widely  used  classification  methods  on  two  standard  bio¬ 
medical  datasets.  After  this  verification,  we  use  our  scheme  to  analyze  multispectral  retinal  images  of  age- 
related  macular  degeneration  (AMD)  patients.  All  eye-related  data  were  collected  by  our  collaborators  at 
the  National  Eye  Clinic  at  the  National  Institutes  of  Health  (Bethesda,  NIH;  Maryland). 
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6.1.  Standard  biomedical  datasets 

6.1.1.  Cleveland  Heart  Disease  Database:  learning  disease  stages.  The  Cleveland  Heart  Dis¬ 
ease  Database  (CHDD)  (Detrano  et  al.,  1989)  contains  297  patterns  with  13  attributes  and  is  grouped  into 
five  progressive  heart  disease  stages  (values  0,1, 2,3,4),  where  0  corresponds  to  normal  heart  conditions.  We 
removed  six  patterns  due  to  missing  values.  Due  to  homeostasis  and  its  failure  in  progressed  disease,  we 
expect  the  query  function /to  be  smoother  on  normal  heart  conditions  and  early  disease  stages,  while  later 
stages  may  form  a  more  heterogeneous  group.  We  compare  our  method  to  support  vector  machines  (SVM), 
in  which  clusters  are  derived  through  sequential  binary  clustering.  Clusters  are  evaluated  by  means  of 
binary  false-positive  or  false-negatives  for  each  disease  stage.  Indeed,  our  proposed  method  recovers  / 
consistently  better  than  SVM  for  the  values  0,  1,  and  2  when  dealing  with  few  training  points  (Table  1).  As 
expected,  our  kernel  performs  poorly  on  the  stages  3  and  4,  implying  the  lack  of  smoothness  of  / within 
these  progressed  stages.  The  transition  from  0  to  2  seems  to  be  steered  by  a  smoother  process,  which  is 
reflected  by  a  smoother  query  function  yielding  better  results  than  SVM  methods. 

6.1.2.  Wisconsin  Breast  Cancer  Database:  data  completion.  After  removing  missing  values,  the 
Wisconsin  Breast  Cancer  Database  (WBCD;  original)  (Wolberg  and  Mangasarian,  1990)  contains  683 
patterns  with  9  attributes.  We  aim  to  predict  quantitative  attributes.  In  fact,  we  randomly  select  200,  300, 
and  400  training  points  to  learn  the  attribute  “clump  thickness”  (ranging  from  1  to  10)  and  aim  to  predict 
its  values  on  the  remaining  data.  We  call  it  a  hit  when  the  prediction  is  within  a  radius  of  1,  i.e.,  if  the 
measured  size  was  3,  the  predictions  in  the  interval  [2,  4]  are  counted  as  a  hit.  The  excellent  performance  of 
our  proposed  method  by  means  of  sensitivity  (number  of  hits  divided  by  the  cluster  size)  for  few  training 
data  is  shown  in  Table  2. 

6.2.  Age-related  macular  degeneration 

Age-related  macular  degeneration  is  the  most  common  cause  of  blindness  among  the  elderly  population 
in  the  western  world  (Chew  et  al.,  2009;  Coleman  et  al.,  2008;  Krishnadev  et  al.,  2010).  Aging  of  the  human 
retina  is  universally  associated  with  microscopic  changes  within  the  retinal  pigment  epithelium  (RPE), 
including  increased  number  and  volume  of  fluorescent  lipofuscin  granules  (Meyers  et  al.,  2004).  In  a 


Table  1.  Sensitivity  Analysis  for  Cleveland  Heart  Disease  Database 


Stage  0 

SVM  linear 

SVM  Gaussian 

Local  kernel 

CHDD,  40 

73% 

71% 

79% 

CHDD,  100 

78% 

80% 

83% 

CHDD,  200 

93% 

95% 

92% 

Stage  1 

CHDD,  40 

69% 

68% 

73% 

CHDD,  100 

73% 

75% 

80% 

CHDD,  200 

88% 

92% 

85% 

Stage  2 

CHDD,  40 

64% 

62% 

71% 

CHDD,  100 

68% 

71% 

75% 

CHDD,  200 

86% 

85% 

81% 

Stage  3 

CHDD,  40 

61% 

60% 

54% 

CHDD,  100 

65% 

66% 

59% 

CHDD,  200 

78% 

79% 

69% 

Stage  4 

CHDD,  40 

57% 

55% 

52% 

CHDD,  100 

63% 

59% 

54% 

CHDD,  200 

72% 

69% 

63% 

Sensitivity  (one  minus  false  negative  rate)  for  disease  stages  0  to  4  using  the  dataset  CHDD  in  Section  6.1.1.  with  40.  100,  and  200 
training  points,  averaged  over  50  instances  for  each  method.  Our  local  kernel  method  performs  better  for  few  training  data  than  SVM 
on  disease  stages  0,  1,  and  2.  in  which  we  expect  the  query  function  to  be  relatively  smooth.  CHDD.  Cleveland  Heart  Disease 
Database. 
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Table  2.  Sensitivity  Analysis  for  the  Wisconsin  Breast  Cancer  Database 


SVM  linear 

SVM  Gaussian 

Local  kernel 

WBCD,  100 

74% 

72% 

80% 

WBCD,  200 

76% 

79% 

84% 

WBCD,  300 

92% 

93% 

90% 

Sensitivity  analysis  for  the  data  set  WBCD  in  Section  6.1.2,  averaged  over  clump  thickness  and  50  instances  for  each  method.  Our 
local  kernel  method  yields  high  sensitivity  compared  to  other  methods  when  there  are  only  few  training  data.  WBCD,  Wisconsin 
Breast  Cancer  Database. 


majority  of  Americans  over  the  age  of  60,  the  earliest  clinical  signs  of  RPE  dysfunction  are  observed  in 
color  fundus  photographs  as  drusen — bright  highly  reflective  extracellular  deposits  between  the  RPE  and 
its  basement  membrane.  Macular  drusen  increase  in  number  and  size  with  advancing  age  in  epidemio¬ 
logical  studies  and  larger,  irregular-shaped,  perifoveal  drusen  (“soft”)  are  considered  to  confer  the  greatest 
risk  for  progression  to  advanced  AMD.  Through  many  years  of  large-scale  studies  of  the  natural  history  of 
AMD  and  controlled  prevention  trials,  clinical  observations  of  fundus  photographs  suggest  that  people  with 
soft  (larger  than  150  microns  and  irregularly  shaped)  drusen  are  at  high  risk  for  progressing  to  advanced 
AMD.  Currently,  pathologists  in  reading  centers  classify  drusen  based  on  size  and  shape  (Bird  et  al.,  1995) 
in  reflection  color  fundus  images.  There  is  demand  for  automated  analysis  tools  that  allow  for  quantitative 
prediction  of  disease  progression. 

6.2.1.  Retinal  multi-spectral  imaging.  The  retina  is  a  multilayer  neural  tissue,  uniquely  suited  for 
noninvasive  optical  imaging  with  high  resolution.  The  first  author,  together  with  his  collaborators  at  NIH, 
developed  noninvasive  multispectral  fluorescence  imaging  of  the  human  retina  by  adding  selected  inter¬ 
ference  filter  sets  to  standard  fundus  cameras,  allowing  the  monitoring  of  early  changes  within  the  RPE  via 
the  fluorescent  lipofuscin  granules  (Dobrosotskaya  et  al..  2010,  2011;  Ehler  et  al.,  2010,  2011a,  2011b; 
Kainerstorfer  et  al.,  2010a,  2010b,  201 1).  If  F( A, A)  denotes  the  measured  autofluorescence,  where  A  is  the 
excitation  and  X  the  emission  wavelength,  then  the  Beer-Lambert  law  for  the  double-path  yields 

F( A,  2)  =  /(A)<Z>(A,  ?.)e~(IXA)+im, 

where  D  is  the  integrated  absorbance  (optical  density)  of  the  tissue  the  light  travels  through,  <b(A,7.)  is  the 
fluorescence  efficiency  of  lipofuscin,  and  1(A)  the  radiant  power  of  the  excitation  light.  For  each  of  the  six 
patients,  four  excitation  filters  with  two  emission  filters  and  trifold  imaging  lead  to  24  images  (400  x  400 
pixels)  that  are  aligned  by  applying  the  commercial  software  i2k  Align.  We  de-noise  in  ’-direction  by 
principal  component  analysis,  keeping  five  eigenvectors  capturing  more  than  98%  of  the  dataset’s  variance. 
Spatial  noise  is  reduced  by  averaging  each  pixel  with  its  eight  neighbors.  Since  the  pathological  changes  are 
related  to  fractional  changes  of  the  autofluorescence,  we  can  normalize  the  160,000  vectors  to  lie  on  the 
sphere  .S'1 . 

The  principal  component  analysis  and  normalization  also  helps  to  compare  pixels  across  patients.  The 
drusen  classification  scheme  presented  in  van  Leeuwen  et  al.  (2003;  see  also  Bird  et  al.,  1995)  leads  to  8 
progressing  stages.  We  partition  them  into  four  classes.  An  expert  grader  labeled  few  spatial  regions,  which 
led  to  2,000  training  pixels  in  three  patients  that  were  each  labeled  with  one  of  the  four  partitions.  Thus,  we 
have  6,000  training  vectors  total,  and  pixels  in  a  single  patient  can  have  distinct  labels.  We  shall  learn 
locally  the  pixel  classification  into  drusen  classes.  Note  that  our  method  relies  on  the  multispectral  com¬ 
ponents  of  drusen  in  autofluorescence  images  rather  than  the  spatial  shape  that  is  seen  in  reflection  color 
fundus  images. 

After  the  learning  step,  we  mark  a  region  of  interest  (ROI)  in  six  patients  (including  the  three  patients 
used  for  learning).  To  classify  pixels  from  the  ROI  into  the  four  partitions,  we  merge  the  ROI  pixels  of  the 
patient  under  consideration  with  the  ROI  pixels  of  the  three  learned  patients  to  form  the  data  on  the 
manifold.  The  labeled  regions  are  the  training  data,  and  since  the  pixel  vectors  lie  on  the  sphere,  we  can 
follow  the  approach  in  Section  5  to  derive  oM/,  x)  (Fig.  2).  The  majority  of  the  pixels  within  the  ROI  would 
define  the  drusen  class  of  the  patient.  It  should  be  mentioned  that  the  size  of  the  ROI  influences  the 
classification  scheme,  and  we  obtained  good  results  with  the  center  in  the  fovea  and  extending  to  5  degrees, 
consistent  with  common  image  analysis  in  ophthalmology. 
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FIG.  2.  Drusen  classification.  The  foveal  class  outnumbers  the  other  classes  in  (d)  and  determines  the  patient’s  class, 
(a,  left)  Drusen  in  color  fundus  image;  (a,  right)  Drusen  in  cross-section  of  a  monkey  retina,  (b)  Two  spectral  images  of 
pre-advanced  AMD  patient,  (c)  Two  spectral  images  of  advanced  AMD  patient,  (d,  left)  Squares  show  class  centers;  (d, 
right)  four  classes,  where  dark  blue  is  not  assigned  to  any  class. 


Next,  we  explore  the  influence  of  the  size  of  the  training  data.  We  only  use  a  fraction  of  the  pixels  that 
were  originally  labeled  by  our  grader.  The  individual  pixels  are  selected  by  a  random  sampling.  Figure  3 
shows  that  we  require  a  critical  number  of  training  pixels  and  from  there  on,  we  obtain  stable  classification 
results. 


7.  CONCLUSIONS 

To  facilitate  the  analysis  of  complex  biomedical  data,  we  developed  the  mathematical  foundations  for  a 
numerical  algorithm  that  enables  global  and  local  data  analysis  integratively.  After  we  validated  our 


FIG.  3.  The  increments  of  the 
training  pixels  are  in  steps  of  200. 
The  y-axis  counts  the  correct  clas¬ 
sifications.  Note  that  we  simply  re¬ 
moved  random  pixels  from  the 
entire  set  of  3  X  2000  training  pix¬ 
els,  so  that  each  smaller  training  set 
is  contained  in  the  larger  one.  We 
repeated  this  process  over  20  runs  to 
avoid  random  anomalies,  which  led 
to  the  black  curve.  The  most  typical 
curve  in  a  single  run  is  depicted  in 
red.  For  few  training  data,  we  only 
classify  one  single  patient  correctly. 
Increasing  the  size  of  the  training 
pixels  enables  consistently  correct 
classifications  of  all  but  one  patient 
(consistently  the  same  one).  It 
seems  that  the  critical  number  of 
training  pixels  is  between  600  and 
800  pixels. 
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approach  on  two  standard  datasets,  we  aimed  to  classify  and  predict  disease  progression  in  AMD  patients. 
Drusen  were  classified  in  multispectral  retinal  image  sets  that  enabled  quantitative  measurements  of  ad¬ 
vanced  pathological  changes.  Clearly,  our  new  mathematical  approach  to  identifying  new  components 
controlling  AMD  progression  is  preliminary  and  needs  further  validation  to  claim  its  usefulness  in  general 
terms.  We  anticipate  improvements  through  synergies  of  multiple  analysis  schemes  with  multi  modal  data 
so  that  our  proposed  analysis  could  be  part  of  an  iterative  process. 


8.  APPENDIX 

Appendix  A.  Technical  assumptions 

For  the  sake  of  readability,  we  list  simple  conditions  used  in  Section  3  that  can  be  further  weakened 
(Filbir  and  Mhaskar,  2010;  Maggioni  and  Mhaskar,  2008).  The  manifold  X  is  supposed  to  be  a  smooth, 
compact,  and  connected  Riemannian  manifold  without  boundary.  We  point  out  that  the  following  technical 
conditions  are  satisfied  when  {(Pk\T=o  are  the  eigenfunctions  of  the  Laplace-Beltrami  operator  on 
X,  hs  eigenvalues,  and  /<  the  standard  Riemannian  probability  measure  (see  Filbir  and  Mhaskar, 

2010,  2011;  Maggioni  and  Mhaskar,  2008;  Sikora,  2004  for  details). 


A.l.  Assumption  on  the  volume  of  balls.  If  a  >  0  is  the  dimension  of  X,  then  we  assume  that 
p(Bfx))<rx,  for  all  x  €  X  and  r  >  0. 

A.l.  Assumption  on  products  of  polynomials.  There  is  a  >  2  such  that,  for  all  f,g£  ITy.  their 
product  fg  is  contained  in  fl„v.  In  fact,  we  only  need  the  weaker  condition  saying  that,  for 

eN  :=  sup  dist(<p7<pjt,  n^iv)^, 

tj,tk<N 

we  have  Nc€n  — >  0  as  N  — >  °°  ,  for  every  c  >  0. 

A.  3.  Assumption  on  the  growth  of  the  heat  kernel.  For 

OO 

K,(x,  y )  =  ]Texp(-  ijfcpfixfpfy), 

I=o 

let  |9 ‘’Kt(x,  y)|  <  t~rjJ2~y  exp(-  cp(x,  y)2/t),for  all  t  £  (0,  1],  x,  y  £  X,  where  c  is  a  constant,  y  =  0,  1,  and  9 
is  a  differential  operator  of  first  order. 


A. 4.  Assumption  on  the  filter.  Let:  R>o  — >  R  be  infinitely  often  differentiable  and  a  nonincreasing 
function  such  that  h(t)  —  1  if  t  <  1/2  and  hit)  =  0  if  t  >  1.  For  instance,  we  can  choose 


see  Figure  1. 


h(x)  = 


1, 


exP  ( 

0, 


(.v-h2(2.v2-2.v-l) . 

A‘“  (.V  —  1  )2  ' 


x<  1/2, 
l/2<x<  1, 
1  <x. 


Appendix  B.  Existence  of  quadrature  weights 

For  training  data  C={xj}J=1  C  X,  let  <5 c  :  =  supteX  dist  p(x,  C )  and  qc  :=  min(7y  p(x,,Xj).  If  X  is  well 
covered  by  C,  i.e.,  5c<2qc<25c,  then  there  exists  a  constant  c  with  the  following  property:  for  N <c/ Sc, 
there  are  positive  numbers  {<U/}J=1,  such  that  the  cubature  formula  fxP(x)dp(x)=  Yp=  l  co/P(*/)  holds,  for 
all  P  £  IIa?  (Filbir  and  Mhaskar,  2010).  Note  that  the  weights  satisfy  Yp=i  coi~  1  anc^  hence  are  bounded 
since  <p0=  l. 


Appendix  C.  Proof  of  local  approximation 

Given  /  £  Wvso(X),  pick  <5  >  0  such  that  </>/  £  W'(X),  for  any  (f>  £  C°°(X),  supported  on  B3^(y0)  and 
equals  one  on  B3s/2(yo).  We  can  choose  <fi  such  that  4>(X)  C  [0,  1].  To  verify  Equation  (8),  we  assume  N>  | 
and  estimate. 
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SUP  \f{x)-aN(f,x)\<  sup  \f(x)-<jN(<j)f,x)\+  sup  \oN(<t>f-f)\ 

x£Bs(y0)  xeBs(y0)  xeBs(y0) 

<  sup  \<j>(x)f(x)-aN(<j>f,x)\+  sup  \aN{(f)f-f,x)\ 

x£Bs(y0)  x£Bs(y0) 

<N~S+  sup  \aN(<j>f-f,x)\, 

xeBstvo) 


where  we  have  used  global  approximation  results  from  Mhaskar  (2010),  applied  to  the  globally  smooth 
function  </>/.  In  the  remainder  of  this  section,  we  shall  estimate  supxeBi(Vo)  |caj(</>/-/,  a)|A-s. 

To  do  so,  we  use  the  localization  property  of  <I>W,  i.e.,  for  S  >  max(l,  a  +  s). 


l<M*.  >01 


N“ 

max(l,  ( Np(x,y))s 


(Filbir  and  Mhaskar,  2010;  Maggioni  and  Mhaskar,  2008;  Mhaskar;  2011).  This  kernel  localization  yields, 
for  x  G  Bs(y0), 


M#-/>  *)l~  X!  ®/'(!  -  fKxj))f(Xj) 
j= o 


N* 

max  (1,  ( Np(x ,  Xj)f 


We  split  the  sum  into  two  parts,  X\  =  {j  :  Xj  /2CV0)}  and  the  remaining  indices,  on  which  the  summands 
vanish.  Thus,  we  obtain 


M#-/,  *)ls  Y  "X1  “  ^(v/))/(.\>) 


Nx 


je i,  max  (1,  (Np(x,  Xj)) 

Since  N>  i  and  p(x ,  xj)  >  S/2  on  the  range  that  we  consider,  we  obtain 


s  • 


M#-/>  *)l~  Y  “X1  -  (Kxj))f(xj) 


N' 


a-S 


\S * 


je h  P(x -  Xj) 

The  ball  B^ix)  is  contained  in  B3s/2(yo)  so  that  X3  =  {j  :  xj  B$/ 2(x)}  yields 

Na~s 


kiv(#-/,  *)l~  “X1  “  <P(xj))f{xj) 

jeh 

;sup  (f(xj))N*-sJ2 


jeJi 


jeli 


P(x>  xj) 


m«~sY 


CD; 


jel2  P(x,  Xj) 


where  we  have  used  that  /  is  bounded.  To  finalize  the  proof,  we  can  apply  the  crude  estimate 
Ylje i2  p/x  f  —  ( I )  ’  which  follows  from  icoj=^-  The  relation  a  -  S  <  -  s  implies  the  desired  in¬ 
equality  supxeBi(vo)  \aN(4)f-f,  x)\^N~\  where  the  constant  may  depend  on  /,  y0,  <5,  a,  and  5. 


Appendix  D.  Computing  eigenfunctions 

The  graph  Laplacian,  as  an  operator  on  smooth  functions,  is 

j  /  M  M 

(£m/)M  =  77  (  Yk^x' yj)f(yj) -fix)  Y k‘-(x, yj) 

V7  = 1  7=1 

and,  when  {}’i}  f=  |  is  uniformly  distributed,  CfM  converges  almost  surely  toward  Ax  as  M  tends  to  infinity 
and  e  to  zero  (Lafon,  2004;  Nadler  et  al.,  2006;  Singer,  2006).  Let  4,  ,  and  g!jy  ;  be  the  r'th  eigenvalue  and 
eigenfunction  of  ( Itie)1-01^2 ClM ,  respectively,  where  a  is  again  the  dimension  of  the  underlying  manifold  X. 
If  A,-  and  <p,  are  the  ith  eigenvalue  and  eigenfunction  of  Ax,  then  according  to  Belkin  and  Niyogi  (2006), 
there  exists  a  sequence  sM  — *■  0  such  that,  in  probability. 
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lim  U 

M— >oo 


£m 
M,  i 


1/|  =  0, 


]im  I \gMJ-(Pi 


M—>oc 


\l2 


=  0. 


In  order  to  compute  g gM  ;,  we  need  to  relate  the  spectrum  of  C£M  with  the  spectrum  of  the  matrix  CJM 
introduced  in  Section  4.  It  was  shown  in  von  Luxburg  et  al.  (2008)  that,  for  fixed  s  and  M,  the  “interesting” 
eigenvalues  and  eigenvectors  of  C‘M  are  in  a  one-to-one  relationship  with  the  “interesting”  eigenvalues  and 
eigenfunctions  of  C‘M,  respectively.  In  fact,  if  g  is  any  eigenfunction  of  C'M  with  eigenvalue  1,  then  the 
sampling  v=(g(yi),  •  •  • ,  8(yM))T  is  an  eigenvector  of  CM  with  eigenvalue  MX.  If  X  is  in  the  discrete 
spectrum  of  C:M,  then  g  is  of  the  form 


E"i  kE(x,yj)-MX' 


(9) 


Conversely,  if  v  is  an  eigenvector  of  CM  with  eigenvalue  MX  such  that  X  is  not  in  the  essential  spectrum  of 
C‘M.  then  g  defined  by  Equation  (9)  is  an  eigenfunction  of  CM  with  eigenvalue  X.  Note  that  then  g 
interpolates  v,  i.e.,  (g(yi),  . . . ,  g(yM))T  =  v.  Thus,  we  diagonalize  CM  for  large  M  and  small  e,  and  inter¬ 
polate  via  Equation  (9),  which  yields  an  approximation  of  the  eigenfunctions  of  Ax. 

Note  that  the  relation  between  eigenvectors  of  C'M  and  CMf  described  above  are  independent  on  the 
distribution  of  [ .  This  is  useful  if  {  are  not  uniformly  distributed  but  distributed  according  to  the 
density  p.  In  this  case,  C'Mf  approximates  the  Fokker-Planck  operator  A+  —  (Coifman  et  al.,  2005b;  Nadler 
et  al.,  2006). 


Appendix  E.  Polynomials  needed  for  the  example  on  the  sphere 
E.l.  Gegenbauer  polynomials.  The  Gegenbauer  polynomials 


U/2J 


e=o 


r(k-£+s) 

r(s)£\(k-2iy. 


(2x)‘ 


k~2t 


are  orthogonal  polynomials  on  the  interval  [-1,  1]  with  respect  to  the  weight  function  (1 
r  is  the  usual  Gamma  function. 


x2)s  w.  where 


E.2.  Orthonormal  basis  for  7idk.  In  the  following,  we  shall  present  an  explicit  basis  {(pk  ,}■”*]  for  Hdk 
such  that  we  can  solve  f2j=  i  tyVk  i(xj)  =  ^k,  o»  f°r  all  i—  1,  . . . ,  mk  and  ^=0,  so  that  N=  Ylk=o  mk- 

For  m,  seN,  and  d  >  1,  let 


Gm(X(d))  '■ 


LV  (-  1)'  I-^-dI2'-^  2i/(m-2i)l 

(2,  2)i(d—  \+2s,  2), 


where  ( a ,  b)i  =  a(a  +  b)  ■■■  (a +  (i-  l)t>)  with  the  convention  (a,  b) 0  =  1  and  X(d)  =  (x i,  ■  •  • ,  xf).  The  collec¬ 
tion  of  polynomials 

Gv{x(d))  '■  =GvJl_V2{x(d))  ■  ■  ■  Gvvdd  i_Vd{x(d- 2}), 

for  v  £  Nd,  v\>V2>  ■  ■  ■  >Vd  =  0,  1,  forms  an  orthonormal  basis  for  Tid  (Karachik,  1998).  This  basis  in 
hand,  we  can  find  the  weights  {(Oj}J=1  that  are  needed  to  define  a  iff.  x).  For  quadratures  on  S 2,  see  Filbir 
and  Themistoclakis  (2008),  and  for  other  bases  of  'Hd.  see,  for  instance,  Dunkl  and  Xu  (2001).  Note  that  the 
polynomial  expression  of  Gr(X(ti})  can  be  computed  using  computer  algebra  software.  Evaluating  these 
polynomials  at  our  data  points  lead  to  round-off  errors  that  can  be  controlled  applying  standard  three-term 
relations  of  orthogonal  polynomials. 
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